Ill III I II II 11 1 I III I II I II I! II 

US006356545B1 

(12) United States Patent ao Patent No.: us 6,356,545 bi 

Vargo et al. (45) Date of Patent: Mar. 12, 2002 



(54) INTERNET TELEPHONE SYSTEM WITH 
DYNAMICALLY VARYING CODEC 

(75) Inventors: Mike Vargo, San Mateo; Jerry Chang, 
Los Altos, both of CA (US) 

(73) Assignee: Clarent Corporation, Redwood City, 
CA(US) 

( * ) Notice: Subject to any disclaimer, the term of this 
patent is extended or adjusted under 35 
U.S.C. 154(b) by 0 days. 

(21) Appl. No.: 08/989,361 

(22) Filed: Dec. 12, 1997 

Related U.S. Application Data 

(63) Continuation-in-part of application No. 08/907,686, filed on 
Aug. 8, 1997, now Pat. No. 6,167,060. 

(51) Int. CI. 7 H04J3/22 

(52) U.S. CI 370/355; 370/468 

(58) Field of Search 370/352, 353, 

370/354, 355, 356, 465, 468, 474, 477, 
401; 704/500, 501, 502, 503, 504 

(56) References Cited 

U.S. PATENT DOCUMENTS 

4,864,562 A * 9/1989 Murakami et al 370/538 

5,187,591 A 2/1993 Guy et al 358/425 

5,394,473 A 2/1995 Davidson 381/36 

5,533,004 A 7/1996 Jasper et al 370/11 

5,539,908 A 7/1996 Chen et al 395/700 

5,555,447 A 9/1996 Kotzin et al 455/72 

5,583,652 A 12/1996 Ware 386/75 



5,617,423 A 
5,699,369 A 
5,742,773 A 



5,881,234 
5,890,108 
5,933,803 
5,940,479 
6,026,082 
6,052,391 
6,064,653 
6,130,883 



4/1997 
12/1997 
4/1998 

3/1999 
3/1999 
8/1999 
8/1999 
2/2000 
4/2000 
5/2000 
10/2000 



Li et al 370/426 

Guha 371/41 

Blomfield-Brown et al 709/ 

228 

Schwob 395/200.49 

Yeldener 704/208 

Ojala 704/223 

Guyetal 379/93.01 

Astrin 370/336 

Deutsch et al 370/540 

Farris 370/237 

Spear et al 370/328 



* cited by examiner 



Primary Examiner — Kwang B. Yao 

(74) Attorney, Agent, or Firm— Carr & Ferrell LLP 

(57) ABSTRACT 

The invention is concerned with improvements in full 
duplex Internet telephone systems with a system architecture 
having low latency and permitting voice communication 
with telephone to telephone or PC to telephone connections. 
The architecture permits dynamic packet-to-packet change 
in codec to adjust for Internet conditions. The voice port 
creates self-describing packet conditions so that the higher 
level software of the system is independent of codec selec- 
tion. In addition to adjusting the codec, the voice port has the 
capability of dynamically and concurrently selecting other 
factors such as the level of error correction redundancy, the 
packet size and packet bundling on a packet-to-packet basis. 
The invention further includes a technique to eliminate dead 
air spaces in the voice data transmission stream by speeding 
up or slowing down the data rate in the buffer while 
maintaining a constant pitch of speech. 

26 Claims, 12 Drawing Sheets 
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INTERNET TELEPHONE SYSTEM WITH 
DYNAMICALLY VARYING CODEC 

CROSS-REFERENCE TO RELATED 
APPLICATIONS 

This application is a Continuation-in-Part of U.S. patent 
application Sen No. 08/907,686, filed Aug. 8, 1997," now 
U.S. Pat. No. 6,167,060 by Mike Vargo and Jerry Chang, 
entitled, "Dynamic Forward Error Correction Algorithm for 
Internet Telephone," which is hereby incorporated by refer- 
ence. 

BACKGROUND OF THE INVENTION 

1. Technical Field 

The present invention relates generally to an Internet 
telephone system operating over a Public Switched tele- 
phone Network (PSTN), and more specifically to an Internet 
telephone system with codecs that dynamically change from 
packet to packet. 

2. Discussion of the Prior Art 

The idea of sending telephone calls over the Internet is 
relatively new, brought on by the desire to avoid expensive 
long distance telephone bills from the major telephone 
companies. While the concept of sending audio and video 
data, e.g. cable programming, over the Internet has been 
known since 1990, it was not until 1993 that a program 
called Maven was created to transmit voice data using a 
personal computer. In 1995, VocalTec offered a beta test 
version of its Windows-based Internet telephone, and that 
same year DigiPhone offered full duplex Internet telephone 
system, which allowed users to talk and listen simulta- 
neously. 

Several problems must be addressed to make an Internet 
telephone product commercially successful. One of the most 
important is maintaining sound quality despite dropouts or 
gaps caused by the Internet. The digital nature of the Internet 
has theoretical advantages vis-a-vis analog networks, but 
when the Internet is busy a caller may have difficulty getting 
through to another party. Moreover, since the Internet is built 
to transfer data packets rather than continuous streams of 
sound, there may be delays and losses. 

For a telephone call to be placed over the Internet, the 
analog voice information must be converted into a digital 
format as a series of data packets that are communicated 
through the Internet's web of computers, routers and servers. 
Data compression algorithms are designed to prevent the 
customer from noticing delays between packets in the data 
stream. 

Analog voice messages spoken by customers are digitized 
and then compressed by a compression/decompression 
('codec') algorithm. There are at least ten different types of 
codecs, each designed to compress data optimally for a 
particular application. Some codecs use audio interpolation 
to fill in dropouts or gaps. Other codecs create high quality 
sound, but use complex algorithms that are slower to execute 
on a given computer. Still other codecs use faster compres- 
sion algorithms, but the sound quality is not as high. 
Whether the speed of the compression algorithm is impor- 
tant for a particular application may depend on the speed of 
the computer executing the algorithm. Different codecs use 
different compression ratios to compress data. For example, 
one codec might compress data by a factor of two from 8 
kbits/second to 4 kbits/second, while another codec might 
compress data by a factor of five from 8 kbits/second to 1.65 
kbits/second. Codecs exist that have data compression fac- 
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tors of twelve, and even as high as fifty, but these require 
more complex mathematical algorithms and the resultant 
sound quality may depend on such things as the frequency 
and computer connection. Exemplary codecs include GSM, 

5 a European standard having a 5: 1 compression ratio, and the 
TrueSpeech codec (of DSP Group, Santa Clara, Calif.) 
having a 15:1 compression ratio. 

Prior art systems typically run only one codec at a time, 
although the codec may be specified initially by the user 

10 through adjustment of the computer settings or through 
selecting the codec from a file menu. Codec programs at 
both ends of an exchange must be able to understand each 
another, so compatibility between codecs may also be an 
issue. 

Codecs do not address data dropouts or loss. Ordinarily, 
the Internet Protocol uses an Automatic Repeat Request 
(ARQ) to request retransmissions of lost messages, but 
voice transmission systems attempt to interpolate lost data 
rather than resend it. 

It is generally known in prior art Internet telephone 

20 systems that codecs can be manually selected since both 
parties must be using the same codec to understand one 
another. U.S. Pat. No. 5,539,908 to Chen et al discloses a 
system for dynamically linking codec algorithms between 
file formats. While Chen et al supports a plurality of existing 

25 and future codec installations, codecs are only changed 
between file formats and not on a packet-to-packet basis. 

U.S. Pat. No. 5,394,473 to Davidson discloses a device 
for coding and decoding of audio signals that optimizes 
between time and frequency resolution through selection of 

30 the coder. However, Davidson is not concerned with an 
Internet telephone system, nor changing the codec on a 
packet-to-packet basis. 

The television and radio industries employ speech com- 
pression techniques in advertisement spots to minimize the 

35 amount of advertising time paid for by sponsors. Such 
techniques speed up the audio data while maintaining a 
constant pitch or frequency for the voiceover. U.S. Pat. No. 
5,583,652 to Ware provides a technique known as time 
domain harmonic scaling for variable speed playback of an 

40 audio/video presentation while keeping the audio and video 
synchronized as well as the audio pitch undistorted. U.S. 
Pat. No. 5,555,447 to Kotzin et al mitigates speech loss in a 
communication system by buffering time-compressed 
speech in a FIFO until the FIFO is substantially empty. 

45 Thereupon, Kotzin et al transitions the communication sys- 
tem from time-compressed to normal speech, 

SUMMARY OF TOE INVENTION 

The present invention sets forth a novel Internet telephone 
50 system architecture for providing full duplex operation with 
low voice latency. The architecture enables a dynamic 
change of codec from packet to packet in the same voice 
data stream in order to adapt to changing network condi- 
tions. The change of codec operates concurrently with a 
55 change in other factors including the level of redundancy of 
the error correction, the packet size and packet bundling. 
The architecture thereby seeks to attain the best speech 
quality and lowest latency given the level of data loss over 
the Internet detected by the system. 
60 One further feature of the present invention is a technique 
for eliminating dead air space in the data stream by speeding 
up or slowing down the data from the buffer while main- 
taining a constant voice pitch. 

65 BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 shows an overview of an Internet telephone system 
according to the present invention; 
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FIG. 2 details the main features of the gateway server of signals its arrival to the software 24 of the gateway server 

FIG. 1; 10. Associated with each audio port of the gateway server 10 

FIG. 3 shows gateway server features in more detail, with fe an ob J ect P ort > called a telephone port, or teleport 33, FIG. 

an emphasis on the software modules; 4 ' tot waits for an incoming call. On the Internet side of the 

_ _ . , , . * . c 5 gateway 10, there is another object called the transport 32. 

FIG. 4 shows the operation of these software modules in BetweeQ the tdeports 33 and the traQsport 32 fe ^ object 

relationship to estabhshing a call connection; ca]led the sessions 31> which joins the ports on one gateway 

FIG. 5 is a flowchart of steps in connecting a call; to ports on another gateway. The session 31 is the commu- 

FIG. 6 shows the software modules of the gateway server nication mechanism between teleports 33, and has two 

in further detail* 10 functions: (1) managing IP network communication between 

FIGS. 1(a) and 1(b) show the relationship between data the ™ com \»Z * nd on }^ fi ends , of l he «™ r ' and < 2 > 

k t ri f • providing labeling and identifiers to indicate the conversa- 

pac e s an rames, 1 - on enc jpQ mt jh e sess i on s 31 finds an available connection 

FIGS. 8(a) to (d) illustrate the operation of the forward SUCD that an incoming call from the telephone line 11 is 

error correction algorithm of the present invention; joined to an outgoing message on the Internet 17. Similarly, 

FIG. 9 is a graph showing speech quality for main classes 15 if an incoming call arrives from the Internet 17, this call is 

of codecs; received by the transport 32 at the ingress side of the server 

FIG. 10 is a block diagram of an AbS codec model; f" d !f sion 3 j links thi f cal1 on 11 the « * 

™™ ✓ , .„ . . . teleport 33 to produce an outgoing call on the PSTN 11. 

FIGS. 11(a) to -(c) , illustrate how the voice port changes ^ gerver 1Q faas ^ 

the codec and redundancy to maintain speech quality; and 20 assodated with the teleports 33 and the transport 32 J For 

FIG. 12 illustrates conceptually the operation of the time example, a teleport 33 has an echo suppressor for voice data 

warping speech algorithm. and also an encapsulating algorithm, to be discussed below. 

The transport 32 contains similar software for data filtering 

DETAILED DESCRIPTION OF THE and correction. 

PREFERRED EMBODIMENTS 2 5 FIG. 5 shows a flowchart of the gateway software process 

1 n 4 . • 1 ** c.u for handling an outgoing (egress) call setup to illustrate the 

t , \\ 1 t m , gene f ° VemeW , thC ° P f ° f 1 hC operation of the transport Beginning in step 51, an incoming 

Internet telephone system of the present invention. A caU is J n ^ - t tQ lhe { £ g ^ { J cns fof 

initiated in North America over a PSTN gateway server 10a Tcp munecdoaSm ln step 52, the transport 32 creates an 

from a PSTN 11a. The server 10a supports either Telephone 30 mcom ing session 31, session,, and in step 53, this session 31 

to Telephone conversations or PC to Telephone connections. ^ DO und to an outgoing or egress session 31, session,, on the 

FIG. 1 shows possible connections over the Internet 17 to remote side of the gateway 10. Then, in step 54, session^is 

Tokyo Japan, Osaka Japan and Taipei Taiwan. In each of bound to an available telephone port 33 on the outbound side 

these cities, a PSTN gateway server lOfc-o' is connected to 0 f the server 10. Step 55 indicates a steady state condition 

a PSTN lla-c and the Internet 17 to receive calls. An ^ where the call has been set up with a pair of ports talking to 

account manager (AM) 15 provides billing, monitoring and 0 ne another. 

authentication of these telephone services for up to 25 The gateway server of the present invention supports both 

servers. The account manager 15 interacts with a relational telephone to telephone conversations as well as PC to 

database 16, and is an intelligent network or service control telephone conversations. Each server can accommodate up 

point. The account manager 15 can be attached at any point ^ to 24 simultaneous conversations. High quality voice com- 

on the network. munication is established with low latency. The Gateway 

Referring to FIG. 2, each of the PSTN gateway servers 20 system includes 10 Base T or 100 Base T network 

consists of a Public Switched Telephone Network 11 and a connections, and has the ability to capture Dual Tone Mul- 

gateway server 10. Each gateway 10 consists of a central tifrequency (DTMF) tones from end users, 

processing unit (CPU) 23, the Windows® Operating System 45 The teleport supports up to 16 different varieties of codec 

(OS) (not shown), gateway software 24, telecommunica- algorithms for speech. A codec is a hardware or software 

tions hardware (preferably G.723.1 TRAU from Natural mechanism for converting analog voice signals to digital 

Microsystems of Framingham, Mass.) 25 and a Network signals and encoding the digital signals, and vice-versa. The 

Interface Card (NIC) 26 connected by a bus. The gateway teleport is designed to be able to switch codecs between one 

operates on a "Winter platform, preferably with Windows 5Q data packet and the next in the same data stream. Each data 

NT 4.0. The telecommunications hardware 25 supports packet is a self -describing package, 

analog, Tl or Integrated Services Digital Network (ISDN) FIG, 6 is a system architectural diagram 60 of FIG. 4 in 

connections to the PSTN 11, and the NIC 26 supports an detaiL voice porl 61 rece i ve s incoming data 

Internet Protocol (IP) such as TCP (Transmission Control packets from the transport 32. Each transport 32 has many 

Protocol) or UDP (User Datagram Protocol) connection to 55 voice ports 61 ^ voice porl 61 has derived classes of the 

the Internet 17. wave port 53^ wmc h contains multimedia Application Pro- 

F1G. 3 shows the gateway server software utilities 24 gram Interfaces (API's), and the teleport 33. Teleport 33 is 

which include modules called sessions 31, transport 32, a connected to the PSTN through the line port 69 and to 

plurality of teleports 33 and a User Interface (UI) 34. A bus transport 32 by the sessions 31. The voice port 61 contains 

35 connects software utilities 24 to CPU 23, the Windows® 60 the codec algorithms 66. Among the different varieties of 

NT 4.0 Operating System 37, the telecommunications hard- codec are the TrueSpeech algorithm 67, Voxware 68, the null 

ware 25 and the NIC 26. speech algorithm and others. 

FIG. 4 illustrates the transport 32 receiving a call from the The voice port 61 is responsible for three functions. First, 

Internet 17, and creating a session 31 to join the call to the it provides forward error correction. Second, it provides an 

teleport 33. Turning now to a general description of how the 65 algorithm for sending and regenerating speech. And third, it 

FIGS. 2, 3 and 4 gateway server 10 operates, assume there provides for alignment and framing of data packets within 

is a call incoming from the PSTN 11. This incoming call the buffer. 
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A data packet is contained within a frame, as shown in As a general principle, the three level fault tolerance is 

FIGS. 7(a) and 7(fc), and each layer of the network adds designed for marginal networks and can accommodate up to 

some special formatting to the data packet within the frame. four consecutive dropped packets. The number of dropped 

Frame A (70) consists of packet 71 plus header 72 and trailer packets varies according to a Poisson or similar type of 

73 ^formation fields, HI and Tl, where header HI (72) and 5 statistical distribution (e.g. Pareto), with the majority of 

traUerfields Tl (73) are specific to the particular network consecutive packet losses being in the range of one to four, 

l™^" ^l*? 111 ? ^^i-TJ "^f 1 ^ ' beslwc f n ««*work with few consecutive packet losses in the tails of the 

links, but is trained differently depending upon the network t ■ . . tl _ f 

link. For another network link, thedata packet 70 is framed > Le " nUmbermg m0rc thaD f ° Ur con ~ ve 

in Frame B (74) with headers 75 and trailers 76, H2 and T2. 

The process by which the network link substitutes its own 10 ^ particular error correction algorithm of the invention 

headers 72, 75 and trailers 73,76 as the protocol for the data ^ described in FIGS. 8(a) to 8(d). In these examples, each 

packet 71 is called encapsulation. box is assumed to be essentially one data packet, but for 

Generally, a given message is not sent as a continuous purposes of illustration each of these packets is illustrated as 

stream of information, but is broken up into blocks of data a tetter of the alphabet. The grouping of data packets in pairs, 

packets having variable lengths. The process by which a 15 triplets or quadruplets in FIGS. 8(a) to 8(d) is for purposes 

network link breaks up the data into packets is called of illustration only; the data stream is continuous without 

segmentation, and the process by which the packets are put spaces between the groupings. In FIG. 8(a), the data stream 

back together into a message at the receiving end is called is illustrated as "This is a sentence." The data stream is 

reassembly. There are a variety of reasons for segmenting a propagating from left to right in the drawing, so that the "T" 

message. First, a given network link only accepts messages 20 comes first, then the "h," then the "i " et cetera, 

up to some fixed length. Second, errors are more readily g/l\ *u a- u c 

controlled, since it is not necessary to retransmit the entire * G * 8( *> sh ,° WS tecuxdmg scheme for error correcuon 

message if there is an error in only part of the message. An ™ th a rcdundanc y of level one - Conceptually, the data 

error becomes more likely as the length of the message f tream 15 arran S ec L 88 a xncs of P airs of data P ackets * ^ 

increases. Third, the network is shared more equitably, and 2 s P eacb paK ^ re P eated 85 the first data 

one message does not monopolize the network, when the P acket of the next P air - Symbolically, for each packet N, 

messages are segmented. N^-^N^+l. The first packet of the first pair is initialized 

One feature of the present invention is a forward error with a nul1 value 10 P rotect a § ainst loss of the firsl data 

correction algorithm for providing packet redundancy. The packet. Level one redundancy translates this into duplicated 

basic problem is how to correct for certain packets of voice 30 data P acket P airs of " 0T "> "T* 1 "' <<hi "» "k". and 50 on - 

information being lost as they are transported across the ^\G. 8(c) illustrates the error correction algorithm of the 

Internet. Prior art approaches used interpolation to deal with invention when the level of redundancy is level two. Here, 

lost packets. In the present invention, lost data packets can data packets are arranged in triplets. The algorithm is 

be recovered because these packets are duplicated down- constructed such that the last packet of the first triplet 

stream in the data field. 35 becomes the middle packet of the next triplet, and the middle 

Packet redundancy effectively slows the data transmission P acket of the first tri P let bec °naes the first packet of the next 

rate because, due to replication, the information density is tri P let * Symbolically, N W)W -»N^+2, and N /(M ,-*N mW +2. As 

not as high. A packet with a redundancy of level one is twice before, the packets are initialized with nulls to permit 

as long as a packet with a redundancy of level zero, and a redundancy for the beginning packets in the data stream, 

packet with a redundancy of level two is three times as long 40 Since the P ackets are m triplets, there must be nulls for the 

as a packet with a redundancy of level zero. Changing the first two P ackets of lne fir st triplet. Therefore, the data 

packet redundancy has some similarity to changing the stream <<This ^ a sentence." is replicated as "00T", "0Th", 

packet size or packet bundling, since the overall data stream ' <Thi "» " his " et cetera - Each new ^plet loses the first packet 

has a different length than before. But while changing the of the last tri P leU 

packet size or bundling puts more information in each 45 FIG- 8(a) illustrates the error correction scheme for redun- 

packet, changing the packet redundancy does not. Still, even dancy of level three. Here, the data packets are arranged in 

at the expense of transmission capacity, it is advantageous to quadruplets. The algorithm is constructed such that the 

provide redundancy in the data stream to eliminate voice second packet in the first quadruplet is mapped to the first 

nulls due to lost data or dropouts and thereby improve voice packet in the second quadruplet; the third packet in the first 

quality. Thus, a certain amount of transmission delay is 50 quadruplet is mapped to the second packet in the second 

sacrificed for the overall success and integrity of the voice quadruplet; and the fourth packet in the first quadruplet is 

transmission. mapped to the third packet in the second quadruplet; the first 

The level of data redundancy for the error correction packet in the first quadruplet is not repeated in the next 

algorithm of the invention is between zero and three. That is, quadruplet. Symbolically, N MC -»N iSp +3; N rtr -*N J4M +3 and 

the data is replicated in zero to three subsequent packets of 55 N four -*N thr +3. In order to prevent the loss of first three data 

the message. The data stream of the message is sequenced, packets, a series of nulls is added to the first three data 

and it is important to keep the sequence intact. When the packets. The first quadruplet is initiated with three nulls, and 

forward error algorithm is enabled, each data packet of the tDese nulls are transformed by the algorithm into two nulls 

speech segment is compared to the previous data packet in m ^ & second quadruplet and one null in the third quadruplet, 

the speech segment packet to determine whether there is a 60 More generally, for a redundancy of level k, for k*=0 to L, 

voice null or gap in the sequence. In the limiting case where the algorithm provides that the I th data packet is repeated k 

the error correction algorithm is not enabled, the level of times at positions (i+k)yfor j=l to k. 

redundancy is zero. If the error correction algorithm is One important feature of the forward error correction 

enabled and a voice null or gap is detected in the sequence, algorithm of the invention is that the level of redundancy can 

then the algorithm regenerates the lost packet through com- 65 be dynamically varied from packet to packet within a data 

paring the sequence numbers of the received packets for stream. For example, one group of packets can have a level 

redundant data. one redundancy, the next group of packets can have a level 
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three redundancy, and the following group of packets can Pulse Code Modulation (PCM) is the simplest form of 
have a level three redundancy. Selecting the level of redun- waveform coding. PCM involves sampling and quantizing 
dancy is one of the features performed by the voice port. the input waveform. Narrow-band speech is typically band- 
Level three redundancy can sustain three consecutive pack- limited to 4 kHz and sampled at 8 kHz. With linear 
ets losses by the Internet without the listener noticing a loss. 5 quantization, 12 bits per sample are required to obtain good 
Instead of changing the packet redundancy, the voice port s P ecc 1 J 1 ' anc | gives a bit rate of 96 kbits/sec. This 
can also dynamically vary the packet size or bundling. The bl } rale may b n e Tcdu ™* b ? usin S ™-™fonn quantization 
packet size may initially be 67 bytes, with 64 bytes of ^S^^^ " * ? S ' * Ioganlhmic 
information and a 3 byte header. The packet bundling may „ , 

be changing by bundling two 64 byte packets together with 10 °° 6 commonly used technique involves trying to predict 

a 3 bvte header to trive a 131 bvte riacket Or the nackel 3 Va,Ue ° f the DeXt Speech fr0m a P revl0us SP«*C° 

a 3 byte header to gwe a 131 byte packet. Or, the packet size fc _ Such coding is possible because there are 

could be changed from 64 bytes to 32 bytes of ^formation due to the ^ * J ecXs of , he vocal tract and 

to give a 35 byte packet, including a 3 byte header. Both the vocal chords in M h , es ^ errof sj j betweM 

packet size and packet bundhng can be changed by the voice ^ dicted fcs md the acmal k has a , ow 

port from packet to packet in the data stream to accommo- is variance when ^ ^tive codi ^ eff< £ ti and ft ^ 

date the loss characterises of the Internet at that particular mereby pogsMc , Q ^ ^ signal ^ fcwer ^ 

than a complete original speech signal. Differential Pulse 

Furthermore, not only does the voice port have the Code Modulation (DPCM) is an example of predictive 

capability of dynamically changing the redundancy, packet coding. 

size and packet bundling from packet to packet, but also the *° M a further reflnement) me redictor and tizer are 

voice port can similarly vary the codec algorithm from adaptive „ that m ch jn time t0 matcn , he charac . 

packet to packet. He packet is given selMescnbing infer- teristics of the h bein coded Such Ada , ive Diffcr . 

mation about what type of codec is needed at the receiver to entia] PCM (AD p CM ) codecs operating at 32 kbits/sec give 

decompress the packetTbe choice of codec at the trans- a n ^ similar to M kbits/sec PCM 

mitter may be derived from a complex function of choices a „, , , , , .. 

f i. jj i. j i.lji- Wavelorm codecs produce speech coding either in the 

of packet redundancy packet size and packet bundling. ^ Qf frequency fa Cq & ng ^ 

The voice port of the present invention can thus dynami- ^ speech ^ split ^ a number of fojq^cy bands? 

cally pick the speech compression algorithm, the data packet called 'sub-bands/ and each band is coded individually 

size, and the type of forward error correction to adapt to usiag> for examp le, a ADPCM coder. Each of the bands is 

network conditions. A complex feedback algorithm dictates then mdiv idually decoded and the plurality of bands are 

the various conditions under which the voice port adjusts re combined to reconstruct the speech signal at the receiver, 

these variables. The voice port can also select from several SBC has advantages because the noise in eacn sub . band is 

qualities of codec in response to possible conditions pre- dependent only on the coding in that sub-band. Therefore, a 

sented by the network. ^ greater number of bits are allocated to the most perceptually 

Generally speaking, the voice port increases the packet important sub-bands so that noise in these frequency bands 

redundancy when it detects a loss of information, and this is low, while a fewer number of bits are allocated to other 

implies that less information will be propagating in a given l ess perceptually important sub-bands so that noise in these 

packet stream. To accommodate the same quantity of infer- bands is higher. Sub-band Coding schemes produce toll 

mation through the limited bandwidth of a modem, the 4Q quality speech communication in the range of 16-32 kbits/ 

speech quality must be sacrificed. Therefore, a faster but sec. The filtering required to split the speech signals into 

lower speech quality codec algorithm is simultaneously sub-bands makes SBC more complex than DPCM, and 

implemented. The result is that the loss of data packets is correspondingly produces more delay, but the delay and 

compensated by the redundancy. complexity are still much less than that of hybrid coding. 

Speech compression is utilized to produce more compact 45 Another type of frequency domain waveform coding is 

representation of spoken sounds. The goal is that the recon- Adaptive Transform Coding (ATC). This technique uses a 

structed speech is perceived to be close to the original fast mathematical transform (e.g. discrete cosine transforms) 

speech. The two main measures of this closeness are intel- to split blocks of speech into a large number of frequency 

ligibility and naturalness. The standard reference point is bands. The number of bits used to code each transform 

called "toll quality speech," which is the speech quality that 50 coefficient is adapted depending upon the spectral properties 

is expected over a standard telephone line. of the speech. Toll quality reproduced speech is achieved 

As shown in FIG. 9, conventional speech coding tech- with bit rates as low as 16 kbits/ sec with ATC. 

niques broadly fit within three classes: (1) waveform codecs, In contrast to waveform coders, source coders possess 

(2) source codecs, and (3) hybrid codecs. Waveform codecs information about how the signal is produced. Source coders 

operate at high bit rates and yield very good speech quality. 55 generate a model of the source of the signal, and extract 

Source codecs operate at very low bit rates and produce parameters of the model from the signal. The model param- 

speech with a synthetic sound quality. Hybrid codecs use eters are given to the decoder. 

combinations of techniques from both source and waveform Source coders for speech are called 'vocoders.* First, the 

coding, and give good speech quality at intermediate bit vocal tract is represented as a time-varying filter and is 

rates * 60 excited with a white noise source for unvoiced segments, or 

Waveform codecs attempt to reconstruct the signal with- by a train of pulses separated by a pitch period for voiced 

out including any knowledge or information about how the segments. Parameters sent to the decoder include a filter 

original signal is produced. Waveform codecs therefore also specification, a voiced/unvoiced flag, the variance of the 

work well with non-speech sound signals. Generally, these excitation signal and the pitch period of the voiced speech, 

codecs have low complexity and produce high quality 65 These parameters are updated every 10-20 milliseconds, 

speech at rates above 16 kbits/sec. The reconstructed speech The model parameters are determined in the encoder either 

quality degrades rapidly below this bit rate. in the time or frequency domains. 
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Vocoders operate at 2.4 kbits/sec or below, and produce 
speech reproduction that is intelligible but not natural sound- 
ing. Therefore, vocoders are mainly applicable for military 
uses where natural sound is not important and the speech is 
encrypted. The simple source coder model of speech repro- 
duction does not make increasing the bit rate worthwhile. 

Hybrid codecs attempt to bridge the gap in the range of 
2.4 kbits/sec to 16 kbits/sec between waveform and source 
codecs. The most common hybrid codec is Analysis-by- 
Synthesis (AbS). The "Analysis-by-Synthesis" designation 
means that the encoder analyzes the input speech by syn- 
thesizing many approximations to it. This codec uses the 
same linear predictive filter model of the vocal tract as LPC 
vocoders, but employs excitation signals to match the recon- 
structed speech waveform to the original speech waveform. 
AbS codecs include the Multi-Pulse Excited (MPE) codec, 
the Regular- Pulse Excited (RPE) codec, and the Code- 
Excited Linear Predictive (CELP) codec. The pan-European 
GSM mobile telephone system uses a simplified RPE codec 
with long-term prediction, operating at 13 kbits/sec to pro- 
vide toll quality speech. 

The basic AbS model of coders 100 and decoders 150 is 
shown in FIG. 10. AbS codecs begin by splitting the input 
speech signal s(n) 101 into coded frames that are about 20 
milliseconds long. Parameters are then determined for a 
synthesis filter 110 belonging to each frame, and an excita- 
tion signal 121 is determined for this filter 110. The synthesis 
filter 110 is designed to model correlation introduced into 
the speech by the vocal tract. The excitation signal u(n) 121 
is defined to minimize the error between the input 101 and 
reconstructed speech 140 when the excitation signal u(n) 
121 is passed into the synthesis filter 110. Encoder 100 
transmits information representing the synthesis filter 110 
analysis parameters and the excitation signal u(n) 121 to the 
decoder for each frame; the corresponding excitation signal 
u(n) 121 is passed through the synthesis filter 110 at the 
decoder 150 to obtain the reconstructed speech I(n) 140, 

Generally, the synthesis filter 110 is an all-pole, short- 
term, linear filter of the form: 

//(r)=lM(z) 

where 



is the predictive error filter determined by minimizing the 
energy of the residual signal produced when the original 
speech segment is passed through it. p is of order 10. 

Furthermore, the synthesizer sometimes includes a pitch 
filter to model the long-term periodicities in the voiced 
speech. These long-term periodicities are also exploited by 
including an adaptive cookbook in the excitation generator 
so an excitation signal u(n) includes a component Gu(n_ fl ), 
where a is the estimated pitch period. MPE and RPE filters 
generally operate without a pitch filter, although a pitch filter 
improves performance. However, a pitch filter is very impor- 
tant for CELP codecs. 

Error weighting block 135 shapes the spectrum of error 
signal e^n) 136 in order to reduce subjective error loudness 
by utilizing the fact that error signal e w (n) is partially 
masked by high energy speech. This weighting produces a 
significant improvement in the subjective quality of the 
reconstructed speech for AbS codecs. 

The dynamic speech codec selection in the voice port of 
the present invention is distinguishable from prior art codec 
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selection techniques. The present codec selection is not 
' adaptive* in the sense of ADPCM where a parameter such 
as the predictor and/or quantizer is "adapted" to match the 
speech segment characteristics within a single codec. More 

5 complexity and mathematical sophistication is required to 
adapt the codec than to change to another codec entirely. 
Neither is the dynamic characteristic of the present invention 
concerned with classifying speech segments in the time 
domain and then applying differential coding to each seg- 

3Q ment as in CELP; an entirely different type of codec is 
applied between speech segments in the instant device. 

The present invention is furthermore distinct from a 
device where the codec is manually selected by the users. 
While a given internet telephone system optionally may 
include a plurality of codec types that are selected from a 

15 menu, as in Chen et al, supra, these systems do not auto- 
matically and dynamically change the codec on a packet- 
to-packet basis in response to system -detected changes in 
voice quality. In the prior art system of Chen et al, the user 
must manually select one or the other of the codecs from a 

20 menu, and the codec type does not automatically change 
from one speech segment to another speech segment in the 
same conversation. While any given codec often contains an 
adaptive feature that changes a parameter, the given codec 
type itself remains a constant. 

25 Most codecs in use for internet telephone systems involve 
hybrid codecs such as GSM (Global System for Mobile 
Communications), which uses Regular Pulse Excited (RPE) 
codec. In RPE, the input speech is segmented into 20 
millisecond frames, and a set of eight short term predictor 

30 coefficients are found for each frame. Each frame is further 
split into four 5 millisecond sub-frames, and the encoder 
finds a delay and gain for the codec's long term predictor for 
each sub-frame. The residual signal for both the short and 
long term filtering of each sub-frame is quantized. The level 

35 of sophistication of other commercially available hybrid 
codec algorithms is comparable to GSM. 

Voxware codec (of Voxware, Inc., Princeton, N.J.) is 
commercially available in a number of varieties, including 
the VR (variable rate), RT (real time) and SC (scalable). The 

40 VR codecs classify individual speech frames into one of four 
classes: silence, unvoiced, mixed voicing and fully voiced, 
and each of these classes uses a different scheme for speech 
parameter transmission. 

A concrete example is shown in FIG. 11(a). Assume the 

45 voice port begins with the commercially available 
TrueSpeech codec algorithm (of DSP Group, Santa Clara, 
Calif.), which encodes speech at 8.5 kbits/ second and with 
no redundancy. A stream of voice data 200 includes a 
plurality of data packets numbered 1 through 10, where each 

50 packet further contains a plurality of data bytes indicated by 
the letters in FIGS. 8(a) to (d). A plurality of packets 210 (i.e. 
packets 1 through 4) in the stream 200 of voice data are 
illustrated to have a format TO, where "T" designates 
TrueSpeech and "0" indicates the level of redundancy. After 

55 noticing dropped packets, the voice port adjusts by selecting 
the Voxware 2,9 kbit/second algorithm having somewhat 
lower sound quality, but with two level redundancy error 
correction. Level two redundancy Voxware includes two 2.9 
kbits/second algorithms, which is still approximately 6 

60 kbits/second. These are illustrated as another plurality of 
packets 220 (i.e. packets 5 through 10) labelled with the 
format V2, where "V" represents a particular type of Vox- 
ware codec and "2" is the level of redundancy. Thus, it is 
possible to change the redundancy and the codec to correct 

65 for dropped packets and utilize the same amount of Internet 
bandwidth. Fault tolerance in the voice transmission data is 
thereby achieved. 
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FIGS, 11(6) and (c) show how the voice port 61 performs 
a codec selection on the voice data stream 200 of FIG. 11(a) 
to maintain speech quality. Voice port 61 has speech quality 
detector 221 and codec selector 222 modules. A first speech 
packet (packet #1 in FIG. 11(a)) enters voice port 61, and 
speech quality of this packet is detected by speech quality 
detector 221. The quality of packet #1 is determined to be 
acceptable by the speech quality detector module 221 since 
it is above the baseline B in FIG. 11(c). Accordingly, codec 
selector module 222 maintains the codec and redundancy as 
"TO" for packet #2. This continues until speech quality 
detector 221 determines that the speech quality of packet #4 
is unacceptable; the speech quality falls below baseline B 
due to changing network conditions. Codec selector 222 
responsively changes both the codec and the redundancy for 
packet #5 to "V2." FIG. 11(c) shows that level two redun- 
dancy Voxware for packets #5 through #10 produces an 
acceptable speech quality. Thus, voice port 61 responds to 
changing network conditions to maintain speech quality. 

It is also possible to vary the size of the individual packets 
or to vary the bundling-of the packets by techniques that are 
well known in the art. The voice port therefore tolerates 
faults in the data stream, while the standard procedure for 
Transmission Control Protocol (TCP) on the Internet is to 
request a retransmission of the data. 

Another important characteristic of the voice port is that 
it permits codec encapsulation so that the higher level 
software is functionally independent of the lower level 
codec software. The codecs are therefore essentially objects 
and neither the transport nor any of the other software needs 
to be compatible with any particular codec. As new codecs 
are introduced, they can be added easily added without 
requiring modifications in the higher level system software. 

An alternative embodiment of the invention adjusts for 
dead time in a speech message by time warping the speech 
at a constant pitch. Generally, once the integrity of the data 
stream is guaranteed by the error correction algorithm, at 
least part of the data stream waits in a buffer on the receiving 
side of the server until it is emptied to the receiver. However, 
when there is no data left in the buffer, there is a danger that 
dead air time can occur, in which the listener hears a gap or 
blank in the transmission. Therefore, the software further 
contains a utility that senses when the data buffer becomes 
depleted, and stretches the data reaching the ear of the 
listener in a manner opposite to the technique utilized in 
television commercials and radio voiceovers to speed up the 
data rate. Effectively, the algorithm contains a lever that 
measures the number of packets in the buffer, and, without 
changing pitch, speeds up or slows down the data rate 
without changing pitch corresponding to the pool depth. The 
Voxware codec particularly supports this algorithm to 
specify the degree of time warp. 

FIG. 12 illustrates conceptually how the time warping 
speech algorithm is implemented. Buffer 90 contains voice 
data 91 from a data stream 92 of an incoming call indicated 
by a level 93 inside buffer 90. This voice data 91 is draining 
through a faucet 94 from buffer 90 to the receiver. Buffer 90 
is shown to be on a seesaw scale 95 so that when buffer 90 
is full of voice data 91, arrow 96 on the other end of seesaw 

95 is high; when buffer 90 is nearly empty of voice data 91, 
arrow 96 is low. Arrow 96 points to a scale 97 sliding 
between fast (F) and slow (S). Output from sliding scale 97 
goes to feedback loop 98 controlling faucet 94. Thus, arrow 

96 points toward F when the buffer level 93 is full, and 
feedback loop 98 increases the rate at which the voice data 
91 is draining through faucet 94 from buffer 90; when buffer 
level 93 is nearly empty of voice data 91, arrow 96 points 
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towards S and the feedback loop 98 makes the voice data 
drain more slowly through faucet 94. Simultaneously, the 
algorithm maintains a pitch for the voice data 91 that is 
constant and independent of the rate of draining of the buffer 
90. 

The invention has been described in general terms accord- 
ing to the preferred embodiments. However, those of ordi- 
nary skill in the art will understand that certain modifications 
or changes may be made to the disclosed embodiment 
without departing from the essential nature of the invention. 
Therefore, the scope of the invention is to be limited only by 
the following claims. 

We claim: 

1. A software method of choosing from a plurality of 
codecs in an Internet telephone system, said method com- 
prising the steps of: 

receiving a plurality of self-describing data packets in a 
voice data stream on a receiving end; 

acquiring a voice quality measurement from said self- 
describing data packets received at said receiving end; 
and 

dynamically changing codec algorithms in response to 
said voice quality measurement on a packet-to-packet 
basis for each packet in said plurality of self -describing 
data packets for optimizing the voice quality of the 
information contained in each said packet. 

2. The software method of claim 1, further comprising: 
varying the length of said packets. 

3. The software method of claim 1, further comprising: 
applying data redundancy to said packets. 

4. The software method of claim 1, further comprising: 
varying the bundling of said packets. 

5. The software method of claim 1, further comprising the 
steps of: 

detecting a quantity of voice data waiting in a voice input 
buffer; 

regulating the rate of removal of said voice data from said 

buffer based upon said quantity from a first speed to a 

second speed; and 
maintaining a constant pitch for said voice data as heard 

by a caller as the rate changes from said first speed to 

said second speed. 

6. The software method of claim 5, wherein said regulat- 
ing step further comprises: 

slowing down the rate of removal of said voice data for 

low quantities; and 
speeding up the rate of removal for high quantities. 

7. A software method of choosing from a plurality of 
codecs in an Internet telephone system, said method com- 
prising the steps of: 

sending a plurality of self-describing data packets in a 
voice data stream on a sending end; 

acquiring a voice quality measurement from said self- 
describing data packets received at a receiving end; and 

dynamically changing codec algorithms in response to 
said voice quality measurement on a packet-to-packet 
basis for each packet in said plurality of self -describing 
data packets for optimizing the voice quality of the 
information contained in each said packet. 

8. The software method of claim 7, further comprising: 
varying the length of said packets. 

9. The software method of claim 7, further comprising: 
applying data redundancy of said packets. 



07/17/2002, EAST version: 1.03.0002 



US 6,3i 

13 

10. The software method of claim 7, further comprising: 
varying the bundling of said data packets. 

11. A software system for choosing from a plurality of 
codecs in an Internet telephone system, comprising: 

a gateway server for receiving a plurality of self- 
describing data packets in a voice data stream on a 
receiving end; and 
a voice port in said gateway server for acquiring a voice 
quality measurement from said self-describing data 
packets received by said gateway server, and 
dynamically changing codec algorithms in response to 
said voice quality measurement on a packet-to- 
packet basis for each packet in said plurality of 
self-describing data packets for optimizing the voice 
quality of the information contained in each said 
packet. 

12. The software system of claim 11, wherein: 
said voice port varies the length of said packets. 

13. The software system of claim 11, wherein: 

said voice port applies data redundancy to said packets. 

14. The software system of claim 11, wherein: 
said voice port varies the bundling of said packets. 

15. The software system of claim 11, further comprising: 
a software utility for detecting a quantity of voice data 

waiting in a voice input buffer; 
said utility regulates the rate of removal of said voice data 

from said buffer based upon said quantity from a first 

speed to a second speed; and 
said utility maintains a constant pitch for said voice data 

as heard by a caller as the rate changes from said first 

speed to said second speed. 

16. The software system of claim 15, wherein: 

said utility slows down the rate of removal of said voice 
data for low quantities of said data in said buffer and 
speeds up the rate of removal for high quantities. 

17. A software system for choosing from a plurality of 
codecs in an Internet telephone system, comprising: 

a gateway server for receiving a plurality of self- 
describing data packets in a voice data stream on a 
receiving end; and 
a voice port in said gateway server for 

acquiring a voice quality measurement from said self- 
describing data packets received by said gateway 
server, 

comparing said voice quality measurement to a numeri- 
cal baseline, and 
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dynamically changing codec algorithms in response to 
said comparison of said voice quality measurement 
to said numerical baseline on a packet-to-packet 
basis for each packet in said plurality of self- 
5 describing data packets for optimizing the voice 

quality of the information contained in each said 
packet. 

18. The software system of claim 17, wherein: 
1Q said voice port varies the length of said packets. 

19. The software system of claim 17, wherein: 

said voice port applies data redundancy to said packets. 

20. The software system of claim 17, wherein: 
said voice port varies the bundling of said packets. 

is 21. The software system of claim 17, wherein said codec 
algorithm is changed if said voice quality measurement is 
less than said numerical baseline and said codec algorithm 
is not changed if said measurement is greater than or equal 
to said numerical baseline. 
20 22. The software system of claim 17, wherein said codec 
algorithm is changed if said voice quality measurement is 
greater than or equal to said numerical baseline and said 
codec algorithm is not changed if said voice quality mea- 
surement is less than said baseline. 
25 23. A software system for choosing from a plurality of 
codecs in an Internet telephone system, comprising: 
means for receiving a plurality of self-describing data 

packets in a voice data stream on a receiving end; 
means for acquiring a voice quality measurement from 
30 said self-describing data packets received at said 
receiving end; 

means for comparing said voice quality measurement to a 
numerical baseline; and 

35 means for dynamically changing codec algorithms in 
response to said comparison of said voice quality 
measurement to said numerical baseline on a packet- 
to-packet basis for each packet in said plurality of 
self -describing packets for optimizing the voice quality 

40 of the information contained in each said packet. 

24. The software system of claim 23, further comprising: 
means for dynamically varying the length of said packets. 

25. The software system of claim 23, further comprising: 
means for applying data redundancy to said packets. 

45 26. The software system of claim 23, further comprising: 
means for varying the bundling of said packets. 

***** 
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